System and method for analyzing electronic message activity

ABSTRACT

Systems and methods for analyzing electronic message activity are disclosed. An example method includes determining a relevancy ranking for each message in a received set of electronic messages, wherein the relevancy ranking indicates whether each message is relevant to a movie, determining an opinion expressed in each message, and computing a prediction of the success of the movie based on the determined opinion for each message.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.11/710,743, filed Feb. 26, 2007, which is a continuation of U.S. patentapplication Ser. No. 11,239,632, filed Sep. 28, 2005, and granted asU.S. Pat. No. 7,188,078 on Mar. 6, 2007, which is a divisional of U.S.patent application Ser. No. 09/686,516, filed Oct. 11, 2000, and grantedas U.S. Pat. No. 7,197,470 on Mar. 27, 2007, the disclosure of which isincorporated herein by reference.

FIELD OF THE DISCLOSURE

The present disclosure relates generally to data collection,organization and analysis and mom particularly, the present disclosurerelates to collection, categorization and analysis of electronicdiscussion messages.

BACKGROUND

Electronic discussion forums have been used in the art to facilitatecommunications between two or more people. Such electronic discussionforums typically allow for exchange of information, ideas and opinionsover an extended period of time, i.e., a discussion about a particulartopic may be initiated by an individual posting a message on day one,and subsequent discussion participants may receive, view or respond tothe message at a later date. Such discussion forums allow evenparticipants new to the forum to review past discussion messages andtherefore to fully participate in the forum. Well-known examples of suchelectronic forums include Web-based and proprietary message boards (bothpublic and private), USENET news groups, and electronic mailing lists.These electronic discussion forums support both synchronous andasynchronous discussions, i.e., one or more participants may injectcommunications into the discussion at the same time, or nearly the sametime, without disrupting the flow of communications. This allows eachindividual electronic discussion forum to be rich with communicationsspanning a wide variety of topics and subjects.

Other electronic discussion forums, such as interactive chat sessions,facilitate more traditional asynchronous-like communications. In thesediscussion forums, participants are typically online at the same timeand are actively responding to messages posted by others. Thesediscussion forums are similar to a traditional telephone discussion inthat the information in exchanged in real-tune. However, a significantdifference is that the electronic discussion forums are, by theirnature, written or recorded message transmissions which may be saved forhistorical records or for analysis at a future date.

The wide-spread growth of the Internet has spurred numerous electroniccommunities, each providing numerous discussion forums dedicated tonearly any conceivable topic for discussion. The participants in aparticular discussion may be geographically dispersed with worldwiderepresentation or may be primarily localized, depending on the topic ordistribution of the four. For example, a mailing list devoted toplanning for city parks in New York City may be only of interest topeople having strong ties to the city or region, while a message boarddevoted to a particular programming language may have participantsspanning the globe.

With so many different topics and subjects within each topic, and somany participants, a significant problem arises in attempting to captureand quantify the communications. Moreover, identifying trends andpredicting future behavior in certain markets based on thecommunications has not been possible in the past because of themagnitude of the communications and the magnitude of topics andsubjects. Further complicating any analysis of communications inelectronic discussion forms is the fact that an individual may easilyparticipate in multiple forums by posting the same message in severaldifferent discussion forums, and that individuals may use more than oneidentity when posting.

SUMMARY

The system and method of the present disclosure allows collection andanalysis of electronic discussion messages to quantify and identifytrends in various markets. Message information data is collected andbecomes a time series stored in a database, indicating the identity orpseudonym of the person posting the message, the contents of the messageand other data associated with the message. This data is analyzed toidentify when new participants enter and leave the discussion and howoften they participate. Calculation of summary statistics describingeach community's behavior over time can also be made. Finally,identification of patterns in this data allows identification ofpseudonyms that play various roles in each community, as describedbelow.

The system of the present disclosure comprises an electronic discussiondata system, a central data store and a data analysis system. Theelectronic discussion data system may comprise a message collectionsubsystem as well as message categorization and opinion ratingsubsystems. The message collection subsystem interfaces with a pluralityof pre-determined electronic discussion forums to gather messageinformation. The message categorization subsystem analyzes the messageinformation and categorizes each message according to a plurality ofpre-determined rules. The opinion rating subsystem further analyzes themessage information and assesses an opinion rating according to aplurality of pre-determined linguistic and associative rules. Thecentral data store of the present disclosure comprises one or morenonvolatile memory devices for storing electronic data including, forexample, message information, results of analyses performed by thesystem and a plurality of other information used in the presentdisclosure. In a preferred embodiment, the central data store furthercomprises a relational database system for storing the information inthe non-volatile memory devices. The data analysis system of the presentdisclosure may comprise an objective data collection subsystem, ananalysis subsystem, and a report generation subsystem. The objectivedata collection subsystem interfaces with a plurality of pre-determinedobjective data sources to collect data which may be used to establishtrends and correlation between real-world events and the communicationexpressed in the various electronic discussion forums. The analysissubsystem performs the analysis of the objective data and messageinformation described above. The report generation subsystem generatesreports of the analysis to end-users. The reports may comprisepre-determined query results presented in pre-defined report formats or,alternatively may comprise ad hoc reports based on queries input by anend-user.

The method of the present disclosure comprises one or more of the stepsof collecting a plurality of message information from a plurality ofpre-determined electronic discussion forums; storing the plurality ofmessage information in a central data store; categorizing the messageinformation according to a plurality of pre-determined rules; assigningan opinion rating to the plurality of message information based on aplurality of pre-determined linguistic patterns and associative rules;collecting a plurality of objective data from a plurality of objectivedata sources; analyzing the message information and the objective datato identify trends in the pattern of behavior in pre-determined marketsand the roles of participants in electronic discussion forums; andgenerating reports for end-users based on the results of the analysesperformed by embodiments of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of the system architecture employed in apreferred embodiment of the present disclosure.

FIG. 2 is a schematic diagram of a message collection subsystemimplemented in a preferred embodiment of the present disclosure.

FIG. 3 is a schematic diagram of the hierarchy used to categorizemessages in a preferred embodiment of the present disclosure.

FIG. 4 is an example of graphical report output by a report generationsubsystem of the present disclosure.

FIG. 5 is a schematic diagram of an embodiment of the present disclosurecomprising a pseudonym registration and tracking service.

DEFINITIONS

Community—a vehicle supporting one or more electronic discussions, suchas a message board, mailing list, or Usenet newsgroup.

Discussion Forum—an area of a community where discussions directed to aparticular theme occur. Examples of discussion forums include Amazonmessage board on Yahoo Message Boards and the Usenet newsgrouprec.arts.movies.current-films.

Message—the text and associated information posted to discussion forums,also referred to herein as “electronic message”.

Topics—the themes designated for discussion in a discussion forum by aparticular community.

Subject—the contents of the “Subject” field in an electronic messageposted in an electronic discussion (as distinct from topics).

Pseudonym—an e-mail address, alias, or other name used by a participantin an electronic discussion forum. A pseudonym is an end-user's identityin a particular community.

Source—the issuer of a pseudonym, such as an e-mail host.

Message Body—the portion of an electronic message comprising thepseudonym's contribution to the electronic discussion. “The MessageBody-generally comprises the data, opinions” or, other informationconveyed in the electronic message, including attached documents orfiles.

Header Information—the portion of an electronic message not includingthe message body. Header Information generally comprises thetransmission path and time/date stamp information, the message sender'sinformation, the message identification number (“message ID”), thesubject.

Buzz Level—for a community, a measure of activity within the community,as determined by the number of distinct pseudonyms posting one or moremessages over a given time frame.

Connectivity—for a community, a measure of its relatedness with othercommunities, as determined by the number of other communities in which acommunity's participants concurrently participate.

Actor—descriptive name of the role that a pseudonym plays in the socialnetworks of communities. Actors can be further classified according tothe following definitions:

Initiator—a pseudonym that commences a discussion, i.e., one that poststhe first message leading to subsequent responses forming a dialog on aparticular subject.

Moderator—a pseudonym that ends a discussion, i.e., one that posts thefinal message closing the dialog on a particular subject.

Buzz Accelerator—a pseudonym whose postings tend to precede a risingbuzz level in a community.

Buzz Decelerator—a pseudonym whose postings tend to precede a fallingbuzz level in a community.

Provoker—a pseudonym that tends to start longer discussion threads;different from buzz accelerators in that the metric is one discussionthread, not the community's overall discussion level.

Buy Signaler—a pseudonym whose postings on a topic tend to precede arising market for that topic.

Sell Signaler—a pseudonym whose postings on a topic tend to precede afalling market for that topic.

Manipulator—a pseudonym with little posting history except asManipulators, whose combined postings on one topic elevate the buzzlevel in the absence of external confirming events.

Connector—a pseudonym who posts on a high number of topics or a highnumber of communities.

Market Mood—a positive/negative market forecast derived from analysis ofthe patterns of actors' behavior.

DETAILED DESCRIPTION

In a preferred embodiment, the present disclosure is implemented using asystem architecture as shown in FIG. 1. The system architecturecomprises electronic discussion data system 10, central data store 20,and analysis system 30. Electronic discussion data system 10 interfacesvia network 4 with selected electronic discussion forums 6 to collectelectronic messages and analyze intrinsic data comprising the messagesaccording to one aspect of the present disclosure. Network 4 may be anycommunications network, e.g., the Internet or a private intranet, andmay use any suitable protocol for the exchange of electronic data, e.g.,TCP/IP, NNTP, HTTP, etc. Central data store 20 is a repository forelectronic messages collected, objective data gathered from externalsources and the results of the various analyses or reports produced bythe system and method of the present disclosure. Central data store 20may be implemented using any suitable relational database applicationprogram, such as, e.g., oracle, Sybase and the like. Data analysissystem 30 receives input from selected objective data sources for use inanalyzing and quantifying the importance of the electronic discussionmessages collected, and provides computer programming routines allowingend-users 9 to generate a variety of predefined and ad hoc reports andgraphical analyses related to the electronic discussion messages. Eachof the main systems comprising the system architecture of the presentdisclosure is described in more detail below.

Central Data Store

Central data store 20 comprises one or more database files stored on oneor more computer systems. In a preferred embodiment, central data store20 comprises message information database 22, topics database 23,objective data database 24, forum configuration database 25, analysisdatabase 26 and reports database 27, as shown in FIG. 1. Messageinformation database 22 comprises the message information collected bymessage collection subsystem 12. In a preferred embodiment, messageinformation database 22 comprises: a, message ID, i.e., a number orother string that uniquely identifies each message; sender information,i.e., the pseudonym, e-mail address or name of each message's author, aposting time and date for each message (localized to a common timezone); a collection time and date for each message; a subject field,i.e., the name of the thread or subject of each message; the messagebody for each message; an in-reply-to field, i.e., the message ID of themessage to which each message was a reply; and the source of themessage.

The function and content of central data store 20's database files 23-27are described in subsequent sections below.

Electronic Discussion Data System

As discussed above, electronic discussion data system 10 gathers certainmessages and analyses them according to the intrinsic informationcomprising the messages. Electronic discussion data system 10 comprisesthree subsystems: message collection subsystem 12, messagecategorization subsystem 14 and opinion rating subsystem 16. Messagecollection subsystem 12 collects message information from data sourcesand stores the information in central data store 20 for later analysis.Message categorization subsystem 14 extracts information about eachmessage in central data store 20 and categorizes the messages accordingto a plurality of pre-defined topics. The subsystem analyzes all aspectsof each message and determines if the message is relevant to one or moreof the topics that the system is currently tracking. A relevancy rankingfor each message is stored in central data store 20 for each topicindicating the strength of the message's relation to each topic. Furtheranalysis of the collected message information is carried out by opinionrating subsystem 16 to determine whether the message conveys a positive,neutral or negative opinion regarding the related topic. Each of thesubsystems of electronic discussion data system 10 are described in moredetail below.

1. Message Collection Subsystem

Message collection subsystem 12 collects electronic message informationfrom the designated electronic discussion forums and passes thecollected messages to central data store 20 and to messagecategorization subsystem 14, as shown in FIG. 1. The collected messagescomprise records stored in message information database 22 in centraldata store 20. Database 22 comprises records including message headerinformation and the message body. In a preferred embodiment, each fieldcomprising message header information comprises a separate field of arecord in database 22. The architecture used in a preferred embodimentof the present disclosure for implementing message collection subsystem12 is shown in the schematic diagram in FIG. 2. This architecturesupports multiple configurations for data collection and is highlyscalable for gathering large or small amounts of message information.FIG. 2 illustrates some of the configurations that may be used in apreferred embodiment of message collection subsystem 12.

As shown in FIG. 2, the message collection subsystem consists of severalcomponents that function together to collect information from electronicdiscussion forums 61 and 62 or discussion data files 63 and 64 ondistributed networks 41-44. Although shown as separate discussionforums, data files and networks, it would be apparent to one skilled inthe art that discussion forums 61 and 63 and data files 63 and 64 couldbe the same discussion forum or data file, and networks 41-44 couldcomprise a single distributed network, such as the Internet. Componentsof message collection subsystem 12 include message collector programsand message processor programs running on one or more computer systems.The computer systems used by message collection subsystem 12 compriseany suitable computers having sufficient processing capabilities,volatile and non-volatile memory, and support for multiplecommunications protocols. In a preferred embodiment, the computersystems used by message collection subsystem 12 comprise UNIX-basedservers such as available from Sun Microsystems, or Hewlett-Packard andthe like. All of the subsystem components can be replicated within asingle computer system or across multiple computer systems for overallsystem scalability.

In a preferred embodiment, message processor programs, e.g., messageprocessor 121 a and 121 b, are in communication with database 22, whichis part of central data store 20 (not shown in FIG. 2). In FIG. 2, themessage processors and central data store are protected fromunauthorized access by firewall security system 122. Other components ofmessage collection subsystem 10 are located at various points in thearchitecture, as described below. As would be apparent to one ofordinary skill in the art, firewall 122 is provided for security and isnot technologically required for operation of the present disclosure.Message processors 121 a and 121 b receive information from the messagecollectors and store the information in the database 22 for laterprocessing. As shown in FIG. 2, message processors 121 a and 121 b mayservice more than one message collector program to facilitate processingof a large volume of incoming messages. Inbound messages are held in aqueue on the message processors, allowing message processors 121 a and121 b to receive many more messages from the message collectors thanthey can actually process for storing in database 22. This architectureallows the rapid collection of millions of messages from tens ofthousands of discussion forums without excessive overloading of thecomputer systems.

As is known in the art, each discussion forum or data file may have aunique message format. For example, an electronic message from onediscussion forum may place the date field first, the message ID second,and the other header and body data last. A different discussion forummay choose to display the message ID first, followed by the pseudonym ofthe participant, and the message body. Moreover, each type of discussionforum has its own communications protocol. For example, thecommunications protocol for an interactive discussion forum (e.g., achat session) is not the same as the communications protocol for USENETnews groups. The message format and protocols need not be static, i.e.,as discussion forums evolve, different data structures and protocols maybe implemented. To accommodate such changes, each message collectorreceives configuration information from forum configuration database 25in central data store 20, either directly or via the message processorsystems. The configuration information indicates the data source, i.e.,the discussion forum or discussion file, from which messages will becollected. The configuration information further comprises programminginstructions tailored for each individual data source to allow themessage collector program to communicate with the data source andextract and parse the message information. Accordingly, messagecollectors can support a wide variety of protocols utilized bydiscussion forums including, e.g., HTTP, NNTP, IRC, SMTP and direct fileaccess. In a preferred embodiment, the general programming instructionsare written the Java programming language with parsing instructionswritten in Jpython scripting language. By storing the configurationinformation in a centralized location, i.e., central data store 20,management of the message collectors is simplified. Accordingly, whenthe data structure for a particular discussion forum changes, theconfiguration information needs to be modified only once.

To ensure compatibility with various computer systems, the messagecollector programs are written utilizing any suitable programminglanguages, preferably Java and Python scripting languages. This allowsthe collector programs to be easily ported across a wide variety ofcomputer operating systems. Moreover, the message collector programs aredesigned to have a. minimal processing footprint so that they can resideon computer systems that are hosting other critical functions.

As noted above, there are several ways to implement the architecturesupporting message collection subsystem 12. In one implementation,message collector programs, shown in FIG. 2 as local message collectors123 a and 123 b, are part of local area network (“LAN”) 124 and areauthorized access through firewall 122. Local message collector 123 ainterfaces through network 41 to collect messages from discussion forum61 and local message collector 123 b has direct access to discussiondata file 63. The latter configuration may be implemented, e.g., if theoperator of message collection subsystem 12 also hosts a community formessage discussion forums. As shown in FIG. 2, a message collector maycollect messages from multiple discussion forums. For example, as shownin FIG. 2, local message collector 123 b also interfaces through network41 to collect messages from discussion forum 61.

In an alternative implementation, message collector programs, such asremote message collectors 125 a and 125 b, are run on external networks.As shown in FIG. 2, the remote message collectors are not part of LAN124 and do not have direct access to the message processor programsrunning behind firewall 122. For security reasons, proxy servers 126 aand 126 b are used to interface with message processor 121 b throughfirewall 122. Functionally, remote message collectors operate in thesame manner as the local message collectors. That is, remote messagecollectors 125 a and 125 b receive configuration information fromcentral data store 20 (via proxy servers 126 a and 126 b, respectively).Moreover, remote message collectors may collect messages from discussionforums over a network or directly from discussion data files, as shownin FIG. 2. Use of remote message collectors allows for geographicdistribution and redundancy in the overall message collection subsystemarchitecture.

2. Message Categorization Subsystem

Message categorization subsystem 14 analyzes the data collected fromdiscussion forums and categorizes the messages into meaningfulgroupings, i.e., parent topics and topics, according to predefined rulesas described below. In a preferred embodiment, message categorizationsubsystem 14 retrieves message information from database 22 and topicinformation from central data store 20 and stores results of thecategorization process in database 22. Alternatively, messagecategorization subsystem 14 may receive input directly from messagecollection subsystem 12 for immediate processing into categories.

Topics database 23 comprises representations of real world topics thatare being tracked and analyzed by the system and method of the presentdisclosure. FIG. 3 shows the hierarchical data structure used in apreferred embodiment of database 23. In a preferred embodiment, abstractroot 231, shown in FIG. 3 as the top-level of the hierarchy, is not anactual topic stored in database 23 and is shown only to illustrate thehierarchy. Similarly, branches 232-234 are shown in FIG. 3 toconceptually show the relationship between topics stored in database 23.Accordingly, branch 232 indicates that some topics stored in database 23may relate consumer entertainment, branch 233 indicates other topicsrelate to stock markets, and branch 234 may include other topics, suchas, e.g., food, sports, technology adoption, and the like. As shown inFIG. 3, the hierarchy comprises one or more parent topics, such asparent topics 235 (related to books), parent topic 236 (related tomovies), parent topic 237 (related to market indexes) and parent topic238 (related to companies). Topics in the hierarchy are the last level,such as, topic 235 a (Tears of the Moon), topic 235 b (The Indwelling),topic 235 c (Hot Six) and topic 235 d (The Empty Chair). As shown inFIG. 2, topics 235 a-235 d are related to each other by parent topic 235(books).

In a preferred embodiment of the present disclosure, messagecategorization subsystem 14 assigns a relevance ranking for each topicto each message collected by message collection subsystem 12. Therelevance ranking is determined based on a set of predefined rulesstored in database 23 for each topic. The rules comprise a series ofconditions defining information relevant to the topic, having anassociated weighting to indicate the strength a particular conditionshould have in determining the overall relevance rank of the messagewith respect to the topic. Messages that need categorization areprocessed by message categorization subsystem 14 synchronously, i.e.,the rules for each topic are applied to each message regardless of therelevance ranking for prior topics. The elements of each message,including subject, source, and content are processed against theconditions of each topic in the database. Based on the conditions thatare satisfied and the weights of those conditions, a relevance rank foreach topic is assigned to each message. As messages are processed, theirrelevance ranking for each topic is updated in message informationdatabase 22 in central data store 20.

An example of the rules which may be processed by message categorizationsubsystem 14 is presented in Table 1, below. In this example, the topicis “The Perfect Storm” which, as shown in FIG. 3, is under the parenttopic “Movies”. The conditions for determining the relevance ranking foreach message are shown in Table 1, below. TABLE-US-00001 TABLE 1Condition Weight Message originated from Yahoo movies discussion forum10 Message subject contains “The Perfect Storm” 90 Message subjectcontains “Perfect Storm” 80 Message body contains “The Perfect Storm” 50Message body contains “The Perfect Storm” and 90 “George Clooney”Message body contains “Warner Brothers” and 75 “Barry Levinson”

The number, nature, and weights for conditions used to determine therelevancy ranking for each topic depends on the nature of the topicitself. The accuracy of the relevancy ranking assigned can be increasedby refining the conditions and weights after analysis of the resultsobtained by the system. For example, analysis of the results in theabove example may show that an additional condition, such as “Messageoriginated from Yahoo” movie discussion forum and message subjectcontains “Perfect Storm” should be included in the rules and have aweight of 99. If subsequent analysis provides refined rules, messagecategorization subsystem 14 may be re-run against each message indatabase 22 to update the relevancy rankings, if desired.

3. Opinion Rating Subsystem

Opinion rating subsystem 16 extracts message information from database22 in central data store 20 and assigns an opinion rating for eachmessage by analyzing textual patterns in the message that may express anopinion. The textual patterns are based on linguistic analysis of themessage information. For example, if the message body includes wordssuch as “movie” and “awful” in the same sentence or phrase and themessage had a high relevancy ranking for the topic “The Perfect Storm”the message may be expressing a negative opinion about the movie.Textual pattern analysis software, such as available from Verity Inc.,of Mountain View, Calif., may be used to assign the opinion rating foreach message. Such passive opinion polling is useful for market analysiswithout the need for individually interviewing active participants in asurvey. Once the rating process is complete, the rating for each opinionprocessed is stored in database 22 in central data store 20.

Data Analysis System

Data analysis system 30 comprises objective data collection subsystem32, analysis subsystem 34 and report generation subsystem 36, as shownin FIG. 1. The overall goal of data analysis system 30 is to identifyand predict trends in actual markets based on the electronic discussiondata being posted to various electronic discussion forums and to providereports for end-users 9 of the system and method of the presentdisclosure.

1. Objective Data Collection Subsystem

Objective data collection subsystem 32 collects objective data from bothtraditional and electronic sources and stores the information indatabase 24 on central data store 20 for later analysis. Objective datasources 8, shown in FIG. 1, may include for example, market data suchbox office sales for recently released movies, stock market activity fora given period, television viewer market share (such as Nielsonratings), and other such objective data. The specific data collectedfrom each objective data source depends on the nature of the marketbeing analyzed. For example, objective data on the stock market mayinclude: a company's name; its Web home page address, i.e., universalresource locator; ticker symbol; trading date; opening price; highprice; low price; closing price and volume. In other markets, theobjective data may include: sales, measured in units sold and/or revenuegenerated; attendance at events; downloads of related software and mediafiles; press release date, time and key words; news event date; and thelike. The objective data is used by analysis subsystem 34 to identifyand predict trends and correlation between real world events andelectronic discussion data, as described below.

2. Analysis Subsystem

Analysis subsystem 34 performs analysis of the information collected bythe message collection subsystem 12 and objective data collectionsubsystem 32, and the categorization and opinion information determinedby message categorization subsystem 14 and opinion rating subsystem 16,respectively. Analysis subsystem 34 determines the existence of anycorrelation between discussion forum postings and market activity foreach topic that the system is currently tracking. The results of theanalysis are stored in the analysis database 26 in central data store 20for eventual presentation to end-users 9. Analysis subsystem 34 examinesthe internal behavior of communities and correlates individual and groupbehavior to the world external to the communities using a variety ofanalysis techniques with a variety of goals. Analysis subsystem 34identifies and categorizes actors by measuring the community's responseto their postings; measures and categorizes the community's mood;correlates actors' behavior and the communities' moods with objectivedata sources; and forecasts the markets' behavior, with confidenceestimates in various timeframes. Identifying and tracking both theactors and the community mood is important, because the effect of anactor's message depends in part on the mood of the community. Forexample, an already-nervous community may turn very negative if a buysignaler or other negative actor posts a message, while the same messagefrom the same person may have little effect on a community in a positivemood. The following sections describe the patterns sought in theanalysis and describes how the community behaves after postings eachlocal pseudonym associated with the patterns.

(a) Actor Classification

Actors are classified by correlating their postings with objective data,which is external to the electronic forum. Changes in the objective data(e.g., stock price changes, increased book sales, etc.) are trackedduring several discrete short time periods throughout a longer timeperiod, such as day. A score is assigned to each pseudonym postingmessages related to a given topic based on the change observed in theobjective data from the preceding discrete time period. A pseudonym'sscore may be high, medium or low, depending on the magnitude of thechange. For example, in a preferred embodiment, pseudonyms who tended topost messages just prior to major increases in stock price, receive ahigh positive scores; while those whose postings tended to precede majordrops have the lowest negative scores. The scores assigned to apseudonym during the longer time period are aggregated into a compositescore for the pseudonym.

As discussed in the definitions sections above, actors can be classifiedas an initiator if the actor tends to post the first message leading tosubsequent responses forming a dialog on a particular subject.Similarly, an actor tending to post the final message closing the dialogon a particular subject is classified as a moderator.

Two of the more interesting classifications made by analysis subsystem34 identify buzz accelerators and buzz decelerators. Because of thecorrelation identified in some markets between the level of discussionin a community and the objective, real-world events, identification ofbuzz accelerators and decelerators can be used to predict the probableoutcome of real-world events. For example, if a local pseudonym wereidentified as a buzz accelerator for electronic discussion forumsrelated to the stock market, whenever that local pseudonym posts amessage to such a forum, one would expect a rise in the discussionlevel, and the correlating drop in stock prices. Related, but notsynonymous, classes of actors are buy signalers and sell signalers. Suchactors tend to post messages at a time preceding a rising or fallingmarket for that topic In contrast to buzz accelerators or decelerators,buy and sell signalers do not necessarily tend to reflect or precederising levels of electronic discussion on the forums.

The final three classes of local pseudonyms are manipulators, provokersand connectors. As noted in the definition sections, a manipulator is alocal pseudonym with little posting history except as manipulators,whose combined postings on one topic elevate the buzz level in theabsence of external confirming events. Such actors may be attempting toobscure analysis or to sway the markets being analyzed. As such,identifying and tracking manipulators is important for ensuring validityof the results output by analysis subsystem 34. Provokers are localpseudonyms that tend to start longer discussion threads, which maycontribute to a community's overall discussion level, but is notindicative of a rise in discussion level for the community. Again,identification and tracking of provokers allows better results in theanalysis of electronic discussion information. Finally, a connector is alocal pseudonym that posts on a high number of topics or a high numberof communities.

Analysis subsystem 34 tracks and observes the behavior characteristic ofthe pseudonyms posting messages to electronic discussion forums andassigns a reputation score indicating their categorization. In apreferred embodiment, the reputation score comprises an array of ratingsfor each of the possible categorizations. From the reputation score,composite views of the tendencies of the pseudonyms can be formed tographically illustrate the pseudonym's reputation in a given community.An example of one such composite view is shown in FIG. 4, wherein apseudonym's reputation as a buzz accelerator/decelerator is plottedagainst its reputation as a buy/seller signaler. As shown in FIG. 4,pseudonym A has a strong tendency as a buy signaler and is a buzzaccelerator, but not a strong buzz accelerator. In contrast, pseudonym Bhas strong tendencies as both a sell signaler and a buzz decelerator inthe market. The impact of the classifications depends, of course on themarket involved, as discussed previously.

(b) Community Mood

As discussed above, pseudonym's classifications are useful to the extentthey can quantify the tendencies of the various actors in a community.However, the impact of such actors on the community depends not only onthe tendencies of the actors, but on the overall mood of the community.The measure of a community's mood is determined from the change indiscussion levels in the community. The mood assigned is based onobserved trends for the associated topic. For example, when discussionlevels rise in stock market forums, the rise is usually accompanied by adrop in stock market prices due to increased selling activity,indicating a negative mood in the community. Similarly, an increase indiscussion levels for a movie topic may indicate a generally positivemood for the community. Other indicators of community mood include thenumber of new participants in a community, which correlates to anincreased interest in the community's topic. Moreover, the combinedpositive and negative influence scores of actors in a community is anindicator of the overall sentiment. Another factor indicating acommunity's mood is its turnover rate, i.e., the number of newparticipants versus the number of old participants, indicates the depthof interest in the community's topic.

The combined provocation-moderation scores of active participants isexpected to be a forecaster of the community's discussion near-termdiscussion level.

The ratio of message volume to external volume (stock trading volume inthe prototype) will be explored as an indicator of confidence for otherforecasts.

The number of active discussion threads, relative to the number ofparticipants, is an indicator whose significance we plan to explore.“Flame wars,” for example, are typically carried out by a small numberof people generating a large volume of messages.

The ratio of “on-topic” to “off-topic” messages, which we expect to beable to measure via linguistic analysis, is an indicator whosesignificance we plan to explore.

Co-occurrence of topics within a community, also measurable vialinguistic analysis, is an indicator of shared interests amongcommunities, whose significance we plan to explore.

(c) Algorithms and Modeling

As discussed above, the analysis system uses patterns in messagepostings to identify community moods and opinion leaders, i.e., thosepseudonyms whose postings can be correlated to changes in the marketand/or forum discussion levels. Linguistic analysis extends thisanalysis by showing and summarizing the subjects under discussion andreveals attitudes toward the topics discussed. The linguistic analysisused in the present disclosure is not intended to explicitly identifyany individual's attitude toward a given topic; rather the overallattitude of the community is assessed.

The analysis system relies on the inherent repeated patterns indiscussions that yield accurate short-term forecasts. The existence ofsuch repeated patterns is known in the art, and can be explained withreference to three areas of research into social networks. Chaos andcomplexity theories have demonstrated that large numbers of agents, eachof whom interacts with a few others, give rise to repeating patterns byvirtue of simple mathematics. Social network theory grounds mathematicalmodels in human behavior. Computer-mediated communications researchapplies the mathematical models to “new media” technologies includingthe Internet.

As with any high-frequency, high-volume data mining challenge, thenumber of potential variables is enormous and the applicable techniquesare many. To simplify this problem, the system and method of the presentdisclosure reduces the data sets as much as possible before analysis.Accordingly, on the assumption that there are a very small number ofopinion leaders relative to participants, the vast majority ofparticipants whose postings did not occur near objective data inflectionpoints, i.e., sharp changes in the objective data, are eliminated. Thisgreatly reduces the amount of data that is further analyzed by thesystem and method of the present disclosure. The period of time overwhich inflection points are identified has a great impact on whichpatterns can be identified and usefulness of the resulting data. Forexample, stock price movement and other markets are known to havefractal patterns, so they have different inflection points depending onthe time frame chosen. Accordingly, different inflection points will beidentified if the period is weekly, monthly, or yearly. The morevolatile a market is, the more inflection points can be found.

The following sections describe the various types of analyses used in apreferred embodiment of analysis subsystem 34.

Statistical Analysis

Histograms divide scores into “bins” that show the distribution acrossthe range of values. Histograms of the positive/negative influencescores, as well as the provoker/moderator scores described above, areused to select statistically significant pseudonyms at the outlying endsof the normal distribution curve. A database query can then calculatethe ratio of these opinion leaders who have posted in the last X days.For example, if 25 of the top 50 “positives” and 10 of the top 50“negatives” posted in the last two days, the ratio would be 2.5,indicating that positive market movement is more likely than negative.

Fourier Analysis

Fourier analysis is a well-established technique, with many variations,for breaking down a complex waveform, such as plots of discussionlevels, into component waves. This makes it possible to subtractregularly occurring waves, such as increased or decreased discussionlevels on weekends, in order to isolate the movements that signalmeaningful events.

On Balance Volume

On Balance Volume (OBV) uses stock trading volume and price to quantifythe level of buying and selling in a security. In a preferred embodimentof the present disclosure, OBV is used, e.g., by substituting the numberof discussion participants for the stock volume. In this context, OBV isa negative indicator, i.e., when it is rising, price tends to fall; whenit falls, price tends to rise.

Moving Average Convergence-Divergence

Moving Average Convergence-Divergence (MACD) is a technical analysisthat may be applied to the discussion levels in the communities. MACDgenerates signals by comparing short term and long-term moving averages;the points at which they cross one another can be buy or sell signals,depending on their directions. MACD can signal when a community'sdiscussion level rises above the recent averages, which is often anindicator of rising nervousness.

Link Analysis

In one embodiment of the present disclosure an “80/20 rule,” supportedby social network research, is used wherein only the 20 percent ofparticipants whose posts are “closest” (in time) to significantobjective data inflection points are analyzed. While this methodsimplifies the task of analyzing the data, there is some risk thatopinion-leading groups may be overlooked. Such groups compriseindividuals that do not consistently post at the same time, but as agroup exhibit the characteristics of individual opinion leaders. Forexample, it is possible Bob, Sam and George form a positive opinionleader group, i.e., when any one of them posts a message, prices tend torise. Data mining link analysis tools are used to explore for thesekinds of relationship and to identify groups of pseudonyms whosebehavior as a group exhibits predictive patterns.

Geographic Visualization

Tools for geographic visualization display the distribution ofinformation on a map. Although geographic location is unknown for manyof the pseudonyms being monitoring, it is available for some of them andwill be tracked as the information becomes available. This analysisallows monitoring of the awareness of a topic, such as a newly releasedconsumer media device, as it spreads throughout the United States andother countries. This analysis will help marketers decide wherepromotional and advertising budgets can be spent most effectively.Marketing experience and the mathematics of social networks predict thatawareness follows a stair-step pattern. The analysis results of thepresent disclosure can be used to identify these plateaus very early,allowing marketers to cut spending earlier than they otherwise would.

Clustering

Cluster analysis allows discovery of groups of local pseudonyms that“travel in the same circles.” For example, there may be a group of 20local pseudonyms that tend to participate in discussions on five topics.This cluster of shared interests is a means of automatically discoveringthat there is some kind of relationship among the five topics. In thefinancial market, it implies that people who are interested in any oneof the five companies are likely to find the other four interesting.Presenting these as recommendations is a form of collaborativefiltering, because it helps the user select a few new topics of interestout of thousands of possibilities. The most significant aspect of thisanalysis is that the computer system needs no knowledge of why thetopics are related; the system can therefore discover new relationships.

Regression

Regression analysis is a well-known method of correlating sets of data.Regression is the most fundamental means for identifying if the patternsin communities have a positive, negative or insignificant correlation toexternal events.

Neural Networks and Genetic Algorithms

Neural networks and genetic algorithms are machine-learning approachesfor finding optimal solutions to complex problems. Neural nets take aset of inputs, Which might be various parameters about a community, suchas message level, ratio of positive to negative opinion leaders, etc.,and discover relative weightings to achieve a desired outcome, such as apredicted stock price. Neural nets have been used successfully in othertypes of financial forecasting and, analysis. Genetic algorithms evolvesolutions to complex problems by imitating the competitive nature ofbiological genetics. Factors under consideration must be encoded in abinary form and a system for ranking the value of the outcome iscreated. Software applications used to perform such analyses in thepresent disclosure are commercially available from, e.g., Ward SystemsGroup, Inc. of Frederick, Md.

3. Report Generation Subsystem

Report presentation subsystem 36 extracts the results of the analysisperformed by analysis subsystem 34 for presentation to end-users 9. In apreferred embodiment, report generation subsystem 36 and presents intoend-users via a Web-based user interface. In this embodiment, thereports are published using a variety of formats, such as, e.g., PDF,HTML, and commercially available spreadsheets or word processors, andthe like. End-users 9 may use any suitable Web browser to view andreceive the reports generated by report generation subsystem 36.Examples of such Web browsers are available from Netscape, Microsoft,and America Online. In an alternative embodiment, report generationsubsystem 36 presents the results in written reports that may be printedand distributed.

Report generation subsystem 36 produces and displays some reportsautomatically and other reports may be specifically requested byend-users 9. For example, in a preferred embodiment, dynamic contentboxes are automatically generated and displayed via a Web server. Suchdynamic content boxes may include a report on the current market mood,displaying a visual indicator for the NASDAQ 100, for example. Such amarket mood graph may contain the NASDAQ 100 market mood over the last 1year together with the closing price of the NASDAQ 100 for the sameperiod. Another dynamic content box could, e.g., display the top fivecompanies where activity is spiking the greatest over the last 1 dayversus activity recorded over the last 10 days. Alternatively, thedynamic content box could display the top five companies that are beingdiscussed by the top five buy signalers. Other such reports can begenerated and displayed automatically such that when end-users 9 connectto the Web server, the reports are presented without the need forrequesting the information.

Other reports that may generated by report generation subsystem 36include for example, a list of the most recent subjects posted by thetop buy signaler for each of the top five most positive market moodcompanies and real-tune trends such as information about postings toInternet based communities. These reports and other may be dynamicallybuilt by report generation subsystem 36 based on requests forinformation from end-users 9. For example, end users 9 may specify acommunity, a pseudonym or a topic about which detailed information canbe presented. For example, if an end-user requests a report concerningpseudonyms meeting a certain criteria, report generation subsystem 36executes a search of all matching pseudonyms together with the source ofthe pseudonym (Yahoo, Raging Bull, etc.) and with links to a profilepage for each pseudonym.

A pseudonym's profile page comprises another report generated bysubsystem 36 and includes, e.g., the pseudonym and its source; an e-mailaddress of the pseudonym at the community, if one exists; the totalnumber of posts that the pseudonym has made in discussion groups thatare being tracked; the number of different topics that the pseudonym hasposted to in discussion groups that are being tracked; the most recentposting date that the pseudonym has made to any discussion group and alink to that posting; a list of most recent postings to discussiongroups categorized by topics; the pseudonym's reputation score for eachcategory; a graphical representation of the pseudonym's reputation(e.g., FIG. 4); and the like.

In addition to retrieving reports concerning particular pseudonyms,report generation subsystem 36 allows end-users 9 to locate detailedinformation about each topic (company, book, movie, etc.). For example,if an end-user requests a report on a particular company, by e.g., thestock symbol or the company name, another search is executed. Reportgeneration subsystem 36 displays information such as a list of allmatching companies; the name of the company; the stock symbol of thecompany; and a link to a company profile page where users can obtaindetailed information about that particular company.

A company profile is similar to a pseudonym's profile page. That is, thecompany profile page is another report generated and displayed by reportgeneration subsystem 36. In a preferred embodiment, the company profilepage comprises detailed information about a particular company,especially information that relates to postings in stock message forumsfor that company. Other information that may be displayed indicates,e.g., the name of the company; the stock exchange that the company is amember of; the domain name for the company's home page and a link; alink to the company's stock board on Yahoo, Raging Bull, Motley Fool orother prominent electronic discussion forums; a list of the mostfrequent posters on the company's stock discussion groups; the top buzzaccelerators and the top buzz decelerators for the company's stockdiscussion groups; and top buy and sell signalers for the company'sstock discussion groups.

For other topics, analogous profile pages can be presented. For example,a movie's profile page may comprise the movie's name, the producer, andother objective information as well as identification of the top buzzaccelerators and decelerators, and other results of output by analysissubsystem 34.

Pseudonym Registration System

As shown in FIG. 5, embodiments of the present disclosure may includepseudonym registration system. 40. Pseudonym registration system 40allows end-users, such as end-users 41 to sign-up (or register) forpseudonym services. The services include creation of pseudonyms for usein posting messages to electronic discussion forums; the capability tobuild a reputation in a community through persistent pseudonym identity,opt-in marketing services (wherein pseudonyms can be registered toreceive selected categories of marketing information). For example, anend-user can register one pseudonym and specify an interest in comicbooks, and register another pseudonym with an interest in stock marketforecasts. Although the two pseudonyms belong to the same person, theperson can more easily differentiate and select the type of informationsought at a particular moment. Moreover, registration with pseudonymregistration system 40 provides a means for end-users 41 to providecertain demographic information (age, gender, salary, and the like)without revealing their actual identity.

In a preferred embodiment, pseudonym registration system 40 provides adigital signature that registered pseudonyms may use to prove theiridentity as a registered synonym. The digital signature allows onepseudonym to be linked to other pseudonyms which may be important toestablish a reputation across multiple communities. For example, if anend-user having a pseudonym of john@yahoo.com on the Yahoo messageboards wishes to post messages on the Amazon.com message boards, it isvery likely that the pseudonym john@amazon.com will already be taken byanother individual. In this case, the end-user would have to select adifferent pseudonym for use on the Amazon message boards, for example,john2@amazon.com. In this case, the end-user can register bothpseudonyms with pseudonym registration system 40 and indicate that theybelong to the same end-user. When positing messages under eitherpseudonym, the end-user authenticates his or her identity by providingthe digital signature in the message. When other participants in thecommunity see the digital signature, they can verify that the end-userjohn@yahoo.com is the same end-user john2@amazon.com by checkingpseudonym registration system 40.

Pseudonym registration system 40 is a useful addition to the overalloperation of the system and method of the present disclosure. Byallowing end-users to register their pseudonyms, the data collected andanalyzed may have more points for correlation. End-users are benefitedboth by better analysis results and by more control over their personalidentifying information.

The foregoing disclosure of embodiments of the present disclosure hasbeen presented for purposes of illustration and description. It is notintended to be exhaustive or to limit the disclosure to the preciseforms disclosed. Many variations and modifications of the embodimentsdescribed herein will be obvious to one of ordinary skill in the art inlight of the above disclosure. The scope of the disclosure is to bedefined only by the claims appended hereto, and by their equivalents.

1. A method to analyze electronic messages, the method comprising:determining a relevancy ranking for each message in a received set ofelectronic messages, wherein the relevancy ranking indicates whethereach message is relevant to a movie; determining an opinion expressed ineach message; and computing a prediction of the success of the moviebased on the determined opinion for each message.
 2. A method as definedin claim 1, further comprising comparing a test set of electronicmessages about a second movie to data related to the success of thesecond movie to determine a correlation between the test set ofelectronic messages and the success of the second movie.
 3. A method asdefined in claim 1, wherein computing the prediction of the successutilizes the correlation.
 4. A method as defined in claim 1, whereindetermining the relevancy ranking includes determining that theelectronic message is relevant to movies before determining therelevancy ranking of the electronic message for the movie.
 5. A methodas defined in claim 1, wherein computing the prediction of the successcomprises determining at least one of a predicted attendance for themovie, a predicted numbers of sales for the movie, or a predictedrevenue for the movie.
 6. A method as defined in claim 1, whereincomputing the prediction of the success comprises determining apredicted number of downloads of the movie.
 7. A method as defined inclaim 1, wherein determining the relevancy ranking includes applying aset of rules having weights to the electronic messages.
 8. A method asdefined in claim 7, wherein one of the rules includes a conditionrelated to a source of the electronic messages.
 9. A method as definedin claim 1, wherein one of the rules includes a condition related towords of at least one of a subject or a body of the electronic messages.10. A method as defined in claim 1, further comprising determining adiscussion level for the movie, wherein determining the prediction ofthe success of the movie is also based on the discussion level.
 11. Acomputer readable medium storing instructions that, when executed, causea machine to: determine a relevancy ranking for each message in areceived set of electronic messages, wherein the relevancy rankingindicates whether each message is relevant to a movie; determine anopinion expressed in each message; and compute a prediction of thesuccess of the movie based on the determined opinion for each message.12. A method as defined in claim 1, further comprising comparing a testset of electronic messages about a second movie to data related to thesuccess of the second movie to determine a correlation between the testset of electronic messages and the success of the second movie.
 13. Amethod as defined in claim 1, wherein computing the prediction of thesuccess utilizes the correlation.
 14. A method as defined in claim 1,wherein determining the relevancy ranking includes determining that theelectronic message is relevant to movies before determining therelevancy ranking of the electronic message for the movie.
 15. A methodas defined in claim 1, wherein computing the prediction of the successcomprises determining at least one of a predicted attendance for themovie, a predicted numbers of sales for the movie, or a predictedrevenue for the movie.
 16. A method as defined in claim 1, whereincomputing the prediction of the success comprises determining apredicted number of downloads of the movie.
 17. A method as defined inclaim 1, wherein determining the relevancy ranking includes applying aset of rules having weights to the electronic messages.
 18. A method asdefined in claim 7, wherein one of the rules includes a conditionrelated to a source of the electronic messages.
 19. A method as definedin claim 1, wherein one of the rules includes a condition related towords of at least one of a subject or a body of the electronic messages.20. A system to analyze electronic messages, the system comprising: amessage categorization subsystem to determine a relevancy ranking foreach message in a received set of electronic messages, wherein therelevancy ranking indicates whether each message is relevant to a movie;an opinion rating subsystem to determine an opinion expressed in eachmessage; and an analysis subsystem to compute a prediction of thesuccess of the movie based on the determined opinion for each message.